-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry-pick cppc_cpufreq series into 24.04_linux-nvidia-adv-6.8-next #29
Open
jamieNguyenNVIDIA
wants to merge
352
commits into
NVIDIA:24.04_linux-nvidia-adv-6.8-next
Choose a base branch
from
jamieNguyenNVIDIA:jamien/cppcfreq
base: 24.04_linux-nvidia-adv-6.8-next
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Cherry-pick cppc_cpufreq series into 24.04_linux-nvidia-adv-6.8-next #29
jamieNguyenNVIDIA
wants to merge
352
commits into
NVIDIA:24.04_linux-nvidia-adv-6.8-next
from
jamieNguyenNVIDIA:jamien/cppcfreq
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently arm_smmu_install_ste_for_dev() iterates over every SID and computes from scratch an identical STE. Every SID should have the same STE contents. Turn this inside out so that the STE is supplied by the caller and arm_smmu_install_ste_for_dev() simply installs it to every SID. This is possible now that the STE generation does not inform what sequence should be used to program it. This allows splitting the STE calculation up according to the call site, which following patches will make use of, and removes the confusing NULL domain special case that only supported arm_smmu_detach_dev(). Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 6554727) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
…_dev() This was needed because the STE code required the STE to be in ABORT/BYPASS inorder to program a cdtable or S2 STE. Now that the STE code can automatically handle all transitions we can remove this step from the attach_dev flow. A few small bugs exist because of this: 1) If the core code does BLOCKED -> UNMANAGED with disable_bypass=false then there will be a moment where the STE points at BYPASS. Since this can be done by VFIO/IOMMUFD it is a small security race. 2) If the core code does IDENTITY -> DMA then any IOMMU_RESV_DIRECT regions will temporarily become BLOCKED. We'd like drivers to work in a way that allows IOMMU_RESV_DIRECT to be continuously functional during these transitions. Make arm_smmu_release_device() put the STE back to the correct ABORT/BYPASS setting. Fix a bug where a IOMMU_RESV_DIRECT was ignored on this path. As noted before the reordering of the linked list/STE/CD changes is OK against concurrent arm_smmu_share_asid() because of the arm_smmu_asid_lock. Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 8c73c32) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Get closer to the IOMMU API ideal that changes between domains can be hitless. The ordering for the CD table entry is not entirely clean from this perspective. When switching away from a STE with a CD table programmed in it we should write the new STE first, then clear any old data in the CD entry. If we are programming a CD table for the first time to a STE then the CD entry should be programmed before the STE is loaded. If we are replacing a CD table entry when the STE already points at the CD entry then we just need to do the make/break sequence. Lift this code out of arm_smmu_detach_dev() so it can all be sequenced properly. The only other caller is arm_smmu_release_device() and it is going to free the cdtable anyhow, so it doesn't matter what is in it. Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d2e053d) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
The caller already has the domain, just pass it in. A following patch will remove master->domain. Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d550ddc) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Introducing global statics which are of type struct iommu_domain, not struct arm_smmu_domain makes it difficult to retain arm_smmu_master->domain, as it can no longer point to an IDENTITY or BLOCKED domain. The only place that uses the value is arm_smmu_detach_dev(). Change things to work like other drivers and call iommu_get_domain_for_dev() to obtain the current domain. The master->domain is subtly protecting the master->domain_head against being unused as only PAGING domains will set master->domain and only paging domains use the master->domain_head. To make it simple keep the master->domain_head initialized so that the list_del() logic just does nothing for attached non-PAGING domains. Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1b50017) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
The SVA code only works if the RID domain is a S1 domain and has already installed the cdtable. Originally the check for this was in arm_smmu_sva_bind() but when the op was removed the test didn't get copied over to the new arm_smmu_sva_set_dev_pasid(). Without the test wrong usage usually will hit a WARN_ON() in arm_smmu_write_ctx_desc() due to a missing ctx table. However, the next patches wil change things so that an IDENTITY domain is not a struct arm_smmu_domain and this will get into memory corruption if the struct is wrongly casted. Fail in arm_smmu_sva_set_dev_pasid() if the STE does not have a S1, which is a proxy for the STE having a pointer to the CD table. Write it in a way that will be compatible with the next patches. Fixes: 386fa64 ("arm-smmu-v3/sva: Add SVA domain support") Reported-by: Shameerali Kolothum Thodi <[email protected]> Closes: https://lore.kernel.org/linux-iommu/[email protected]/ Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit ae91f65) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Move to the new static global for identity domains. Move all the logic out of arm_smmu_attach_dev into an identity only function. Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 12dacfb) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Using the same design as the IDENTITY domain install an STRTAB_STE_0_CFG_ABORT STE. Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 352bd64) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Consolidate some more code by having release call arm_smmu_attach_dev_identity/blocked() instead of open coding this. Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d36464f) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Instead of putting container_of() casts in the internals, use the proper type in this call chain. This makes it easier to check that the two global static domains are not leaking into call chains they should not. Passing the smmu avoids the only caller from having to set it and unset it in the error path. Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d8cd200) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Now that the BLOCKED and IDENTITY behaviors are managed with their own domains change to the domain_alloc_paging() op. For now SVA remains using the old interface, eventually it will get its own op that can pass in the device and mm_struct which will let us have a sane lifetime for the mmu_notifier. Call arm_smmu_domain_finalise() early if dev is available. Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Tested-by: Moritz Fischer <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 327e10b) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Make pointer to bus_type a pointer to const for code safety. Signed-off-by: Krzysztof Kozlowski <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit e70e9ec) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
The xlate callbacks are supposed to translate of_phandle_args to proper provider without modifying the of_phandle_args. Make the argument pointer to const for code safety and readability. Signed-off-by: Krzysztof Kozlowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit b42a905) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Make pointer to fwnode_handle a pointer to const for code safety. Signed-off-by: Krzysztof Kozlowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 5896e6e) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
iommu_ops_from_fwnode() stores &iommu_spec->np->fwnode in local variable, so use it to simplify the code (iommu_spec is not changed between these dereferences). Signed-off-by: Krzysztof Kozlowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit b5a1f75) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
The NVIDIA Grace Hopper GPUs have device memory that is supposed to be used as a regular RAM. It is accessible through CPU-GPU chip-to-chip cache coherent interconnect and is present in the system physical address space. The device memory is split into two regions - termed as usemem and resmem - in the system physical address space, with each region mapped and exposed to the VM as a separate fake device BAR [1]. Owing to a hardware defect for Multi-Instance GPU (MIG) feature [2], there is a requirement - as a workaround - for the resmem BAR to display uncached memory characteristics. Based on [3], on system with FWB enabled such as Grace Hopper, the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host mapping (S2) of MT_S2_FWB_NORMAL_NC. KVM currently maps the MMIO region in S2 as MT_S2_FWB_DEVICE_nGnRE by default. The fake device BARs thus displays DEVICE_nGnRE behavior in the VM. The following table summarizes the behavior for the various S1 and S2 mapping combinations for systems with FWB enabled [3]. S1 | S2 | Result NORMAL_NC | NORMAL_NC | NORMAL_NC NORMAL_NC | DEVICE_nGnRE | DEVICE_nGnRE Recently a change was added that modifies this default behavior and make KVM map MMIO as MT_S2_FWB_NORMAL_NC when a VMA flag VM_ALLOW_ANY_UNCACHED is set [4]. Setting S2 as MT_S2_FWB_NORMAL_NC provides the desired behavior (uncached, unaligned access) for resmem. To use VM_ALLOW_ANY_UNCACHED flag, the platform must guarantee that no action taken on the MMIO mapping can trigger an uncontained failure. The Grace Hopper satisfies this requirement. So set the VM_ALLOW_ANY_UNCACHED flag in the VMA. Applied over next-20240227. base-commit: 22ba90670a51 Link: https://lore.kernel.org/all/[email protected]/ [1] Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [2] Link: https://developer.arm.com/documentation/ddi0487/latest/ section D8.5.5 [3] Link: https://lore.kernel.org/all/[email protected]/ [4] Cc: Alex Williamson <[email protected]> Cc: Kevin Tian <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Vikram Sethi <[email protected]> Cc: Zhi Wang <[email protected]> Signed-off-by: Ankit Agrawal <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alex Williamson <[email protected]> (cherry picked from commit 81617c1) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
This reverts commit 873aefb. This was a heinous workaround and it turns out it's been fixed in mm twice since it was introduced. Most recently, commit c8070b7 ("mm: Don't pin ZERO_PAGE in pin_user_pages()") would have prevented running up the zeropage refcount, but even before that commit 84209e8 ("mm/gup: reliable R/O long-term pinning in COW mappings") avoids the vfio use case from pinning the zeropage at all, instead replacing it with exclusive anonymous pages. Remove this now useless overhead. Suggested-by: David Hildenbrand <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alex Williamson <[email protected]> (cherry picked from commit 5b99241) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
iommu-dma does not explicitly reference min_align_mask since we already assume that will be less than or equal to any typical IOVA granule. We wouldn't realistically expect to see the case where it is larger, and that would be non-trivial to support, however for the sake of reasoning (particularly around the interaction with SWIOTLB), let's clearly enforce the assumption. Signed-off-by: Robin Murphy <[email protected]> Reviewed-by: Michael Kelley <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/dbb4d2d8e5d1691ac9a6c67e9758904e6c447ba5.1709553942.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit e2addba) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
STRTAB_STE_0_V is a CPU value, it needs conversion for sparse to be clean. The missing annotation was a mistake introduced by splitting the ops out from the STE writer. Fixes: 7da51af ("iommu/arm-smmu-v3: Make STE programming independent of the callers") Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 0493e73) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
STE attributes(NSCFG, PRIVCFG, INSTCFG) use value 0 for "Use Icomming", for some reason SHCFG doesn't follow that, and it is defined as "0b01". Currently the driver sets SHCFG to Use Incoming for stage-2 and bypass domains. However according to the User Manual (ARM IHI 0070 F.b): When SMMU_IDR1.ATTR_TYPES_OVR == 0, this field is RES0 and the incoming Shareability attribute is used. This patch adds a condition for writing SHCFG to Use incoming to be compliant with the architecture, and defines ATTR_TYPE_OVR as a new feature discovered from IDR1. This also required to propagate the SMMU through some functions args. There is no need to add similar condition for the newly introduced function arm_smmu_get_ste_used() as the values of the STE are the same before and after any transition, so this will not trigger any change. (we already do the same for the VMID). Although this is a misconfiguration from the driver, this has been there for a long time, so probably no HW running Linux is affected by it. Reported-by: Will Deacon <[email protected]> Closes: https://lore.kernel.org/all/20240215134952.GA690@willie-the-truck/ Signed-off-by: Mostafa Saleh <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit ec9098d) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
In current code, swiotlb_bounce() may do partial sync's correctly in some circumstances, but may incorrectly fail in other circumstances. The failure cases require both of these to be true: 1) swiotlb_align_offset() returns a non-zero "offset" value 2) the tlb_addr of the partial sync area points into the first "offset" bytes of the _second_ or subsequent swiotlb slot allocated for the mapping Code added in commit 868c9dd ("swiotlb: add overflow checks to swiotlb_bounce") attempts to WARN on the invalid case where tlb_addr points into the first "offset" bytes of the _first_ allocated slot. But there's no way for swiotlb_bounce() to distinguish the first slot from the second and subsequent slots, so the WARN can be triggered incorrectly when NVIDIA#2 above is true. Related, current code calculates an adjustment to the orig_addr stored in the swiotlb slot. The adjustment compensates for the difference in the tlb_addr used for the partial sync vs. the tlb_addr for the full mapping. The adjustment is stored in the local variable tlb_offset. But when NVIDIA#1 and NVIDIA#2 above are true, it's valid for this adjustment to be negative. In such case the arithmetic to adjust orig_addr produces the wrong result due to tlb_offset being declared as unsigned. Fix these problems by removing the over-constraining validations added in 868c9dd. Change the declaration of tlb_offset to be signed instead of unsigned so the adjustment arithmetic works correctly. Tested with a test-only hack to how swiotlb_tbl_map_single() calls swiotlb_bounce(). Instead of calling swiotlb_bounce() just once for the entire mapped area, do a loop with each iteration doing only a 128 byte partial sync until the entire mapped area is sync'ed. Then with swiotlb=force on the kernel boot line, run a variety of raw disk writes followed by read and verification of all bytes of the written data. The storage device has DMA min_align_mask set, and the writes are done with a variety of original buffer memory address alignments and overall buffer sizes. For many of the combinations, current code triggers the WARN statements, or the data verification fails. With the fixes, no WARNs occur and all verifications pass. Fixes: 5f89468 ("swiotlb: manipulate orig_addr when tlb_addr has offset") Fixes: 868c9dd ("swiotlb: add overflow checks to swiotlb_bounce") Signed-off-by: Michael Kelley <[email protected]> Dominique Martinet <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> (cherry picked from commit e8068f2) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
The disable_bypass parameter has been mostly meaningless for a long time since the introduction of default domains. Its original intent is now fulfilled by the controls users have over the default domain type, and its remaining effect in the brief window between Stream Table initialisation and default domain creation hardly seems worth the complication. Furthermore, thanks to 2-level Stream Tables, disabling disable_bypass (there's another reason not to like it right there) has never guaranteed that any particular StreamID *will* bypass anyway - any device which might actually care about that wants RMRs - so there's not really much lost by taking away that option (which has already been non-default for nearing 6 years now). As part of this, also remove the weird behaviour where we "successfully" probe and register a non-functional SMMU if the DT "#iommu-cells" property is wrong. I have no memory of what possessed me to think that was a good idea at the time, and by now I suspect it's likely to break things worse than simply failing probe would. Signed-off-by: Robin Murphy <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Link: https://lore.kernel.org/r/ea3ac4cd595a81b5511729601b2f7d4668178438.1712335927.git.robin.murphy@arm.com Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 734554f) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
…ASID The SVA code is wired to assume that the SVA is programmed onto the mm->pasid. The current core code always does this, so it is fine. Add a check for clarity. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit fdc69d3) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
At this point we know which master we are going to change the PCI config on, this is the only device we need to invalidate. Switch arm_smmu_atc_inv_domain() for arm_smmu_atc_inv_master(). Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Moritz Fischer <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 86e5ca0) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Instead of passing a naked __le16 * around to represent a CD table entry wrap it in a "struct arm_smmu_cd" with an array of the correct size. This makes it much clearer which functions will comprise the "CD API". Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Moritz Fischer <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit e8e4398) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Existing remove_dev_pasid() callbacks of the underlying iommu drivers get the attached domain from the group->pasid_array. However, the domain stored in group->pasid_array is not always correct in all scenarios. A wrong domain may result in failure in remove_dev_pasid() callback. To avoid such problems, it is more reliable to pass the domain to the remove_dev_pasid() op. Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit d2f85a2) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Free the IOMMU page tables via iommu_put_pages_list(). The page tables were allocated via iommu_alloc_* functions in architecture specific places, but are released in dma-iommu if the freelist is gathered during map/unmap operations into iommu_iotlb_gather data structure. Currently, only iommu/intel that does that. Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 95b18ef) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
In order to improve observability and accountability of IOMMU layer, we must account the number of pages that are allocated by functions that are calling directly into buddy allocator. This is achieved by first wrapping the allocation related functions into a separate inline functions in new file: drivers/iommu/iommu-pages.h Convert all page allocation calls under iommu/intel to use these new functions. Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 06c3750) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
…pages.h Convert iommu/io-pgtable-arm.c to use the new page allocation functions provided in iommu-pages.h. Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 9a3dd4c) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Add NR_IOMMU_PAGES into node_stat_item that counts number of pages that are allocated by the IOMMU subsystem. The allocations can be view per-node via: /sys/devices/system/node/nodeN/vmstat. For example: $ grep iommu /sys/devices/system/node/node*/vmstat /sys/devices/system/node/node0/vmstat:nr_iommu_pages 106025 /sys/devices/system/node/node1/vmstat:nr_iommu_pages 3464 The value is in page-count, therefore, in the above example the iommu allocations amount to ~428M. Signed-off-by: Pasha Tatashin <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit bd3520a) Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM) feature that enable the GPU to access the system memory within and across nodes. The GPU can allocate the system memory located on the same socket or from a different socket or even on a different node in a multi-node system [1]. The feature is being extended to virtualization. EGM when enabled in the virtualization stack, the host memory is partitioned into two: One partition for the Host OS usage, and a second EGM region that is assigned to the VM. The EGM region essentially becomes the system memory of the VM. The EGM/VM region is not available to the host memory for its usage as it is not added to the kernel. Its base HPA and the length is communicated through the DSDT entries. A linear mapping between the VM IPA and system HPA is a requirement for EGM support. The EGM region is thus assigned to a VM by mapping the QEMU VMA to a linearly increasing HPA of the EGM region using remap_pfn_range(). Patch 1/4 change the KVM code to allow EGM memory to be S2 mapped with executable flag. Patch 2/4 introduce a new nvgrace-egm helper module to nvgrace-gpu to manage the carveout for the VM. The module implements a char device to expose the EGM to usermode apps such as QEMU. The module does a linear mapping of the QEMU VMA to the EGM HPA using remap_pfn range. Patch 3/4 fetches the list of pages known with ECC errors on the EGM memory and expose them through an ioctl. Patch 4/4 registers the EGM memory for ECC poison error handling. Link: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [1] Ankit Agrawal (4): KVM: arm64: Allow exec fault on memory mapped cacheable in VMA vfio/nvgrace-egm: Introduce module to manage EGM vfio/nvgrace-egm: Handle pages with ECC errors on the EGM vfio/nvgrace-egm: Register EGM for runtime ECC poison errors handling arch/arm64/kvm/mmu.c | 5 +- drivers/vfio/pci/nvgrace-gpu/Kconfig | 11 + drivers/vfio/pci/nvgrace-gpu/Makefile | 3 + drivers/vfio/pci/nvgrace-gpu/egm.c | 449 ++++++++++++++++++++++++++ drivers/vfio/pci/nvgrace-gpu/egm.h | 12 + drivers/vfio/pci/nvgrace-gpu/main.c | 35 +- include/uapi/linux/egm.h | 26 ++ 7 files changed, 536 insertions(+), 5 deletions(-) create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.c create mode 100644 drivers/vfio/pci/nvgrace-gpu/egm.h create mode 100644 include/uapi/linux/egm.h -- 2.34.1 Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
nvgrace-egm exposes the API register_egm_node & unregister_egm_node to manage EGM (Extended GPU Memory) present on the system. To allow out-of-tree driver such as nvidia-vgpu-vfio make use of them, move the declaration to a new nvgrace-egm.h in include. Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
NVIDIA is planning to productize a new Grace Hopper superchip SKU with device ID 0x2348. Add the SKU devid to nvgrace_gpu_vfio_pci_table. Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
file Minor updates to the nvgrace-gpu-vfio-pci module to enable an additional SKU and support out-of-tree drivers using EGM services. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
…ization This adds the following config options to annotations: CONFIG_FAULT_INJECTION=y CONFIG_IOMMUFD=y CONFIG_IOMMUFD_TEST=y CONFIG_IOMMUFD_VFIO_CONTAINER=y CONFIG_VFIO_CONTAINER=n CONFIG_TEGRA241_CMDQV=y CONFIG_IOMMU_IOPF=y CONFIG_NVGRACE_GPU_VFIO_PCI=m CONFIG_NVGRACE_EGM=m CONFIG_FAILSLAB=n CONFIG_FAIL_FUTEX=n CONFIG_FAIL_IO_TIMEOUT=n CONFIG_FAIL_MAKE_REQUEST=n CONFIG_FAIL_PAGE_ALLOC=n CONFIG_FAULT_INJECTION_CONFIGFS=n CONFIG_FAULT_INJECTION_DEBUG_FS=n CONFIG_FAULT_INJECTION_USERCOPY=n CONFIG_SCSI_UFS_FAULT_INJECTION=n Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Add IFC related stuff for data direct. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/82da7f578a567909bb5858a64ba844fe4cc298fa.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (cherry-picked from commit c772a2c linux) Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Introduce the 'data direct' driver for a ConnectX-8 Data Direct device. The 'data direct' driver functions as the affiliated DMA device for one or more capable mlx5_ib devices. This DMA device, as the name suggests, is used exclusively for DMA operations. It can be considered a DMA engine managed by a PF/VF, lacking network capabilities and having minimal overall capabilities. Consequently, the DMA NIC PF will not be exposed to or directly used by software applications. The driver will not have any direct interface or interaction with the firmware (no command interface, no capabilities, etc.). It will operate solely over PCI to enable its DMA functionality. Registration and un-registration of the driver are handled as part of the mlx5_ib initialization and exit processes, as the mlx5_ib devices will effectively be its clients. The driver will serve as the DMA device for accessing another PCI device to achieve optimal performance (both on the same NUMA node, P2P access, etc.). Upon probing, it will read its VUID over PCI to handle mlx5_ib device registrations with the same VUID. Upon removal, it will notify its clients to allow them to clean up the resources that were mmaped with its DMA device. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/b77edecfd476c3f445da96ab6aef499ae47b2829.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (cherry-picked from commit 6910e36 linux) Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
…evice Add the NET device initialization flow to utilize the 'data direct' device. When a NET mlx5_ib device is capable of 'data direct', the following sequence of actions will occur: - Find its affiliated 'data direct' VUID via a firmware command. - Create its own private PD and 'data direct' mkey. - Register to be notified when its 'data direct' driver is probed or removed. The DMA device of the affiliated 'data direct' device, including the private PD and the 'data direct' mkey, will be used later during MR registrations that request the data direct functionality. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/b11fa87b2a65bce4db8d40341bb6cee490fa4d06.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (backported from commit 2e8e631 linux) [tdave: Resolve conflict in main.c and mlx5_ib.h] Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
…ma device Add support for creating pinned DMABUF umem with a specified DMA device instead of the DMA device of the given IB device. This API will be utilized in the upcoming patches of the series when multiple path DMAs are implemented. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/038aad36a43797e5591b20ba81051fc5758124f9.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (cherry-picked from commit 682358f linux) Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Introduce an option to revoke DMABUF umem. This option will retain the umem allocation while revoking its DMA mapping. Furthermore, any subsequent attempts to map the pages should fail once the umem has been revoked. This functionality will be utilized in the upcoming patches in the series, where we aim to delay umem deallocation until the mkey deregistration. However, we must unmap its pages immediately. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/a38270f2fe4a194868ca2312f4c1c760e51bcbff.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (cherry-picked from commit 253c61d linux) Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API instead of udata. This enables passing some new ioctl attributes to the drivers, as will be introduced in the next patches for mlx5 driver. Change the involved drivers accordingly. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/9a25b2fc02443f7c36c2d93499ae25252b6afd40.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (cherry-picked from commit 3aa73c6 linux) Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (backported from commit de8f847 linux) [tdave: resolve conflict in main.c, mr.c and mlx5_user_ioctl_cmds.h , also rework reg_user_mr_dmabuf() to sync with recent upstream rdma changes] Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Introduce the 'GET_DATA_DIRECT_SYSFS_PATH' ioctl to return the sysfs path of the affiliated 'data direct' device for a given device. Signed-off-by: Yishai Hadas <[email protected]> Link: https://patch.msgid.link/403745463e0ef52adbef681ff09aa6a29a756352.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]> (cherry-picked from commit ec7ad65 linux) Signed-off-by: Tushar Dave <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Jamie Nguyen <[email protected]> Acked-by: Carol L Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]>
Ignore: yes Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
Signed-off-by: Brad Figg <[email protected]>
The CPPC performance feedback counters could be 0 or unchanged when the target cpu is in a low-power idle state, e.g. power-gated or clock-gated. When the counters are 0, cppc_cpufreq_get_rate() returns 0 KHz, which makes cpufreq_online() get a false error and fail to generate a cpufreq policy. When the counters are unchanged, the existing cppc_perf_from_fbctrs() returns a cached desired perf, but some platforms may update the real frequency back to the desired perf reg. For the above cases in cppc_cpufreq_get_rate(), get the latest desired perf from the CPPC reg to reflect the frequency because some platforms may update the actual frequency back there; if failed, use the cached desired perf. Fixes: 6a4fec4 ("cpufreq: cppc: cppc_cpufreq_get_rate() returns zero in all error cases.") Signed-off-by: Jie Zhan <[email protected]> Reviewed-by: Zeng Heng <[email protected]> Reviewed-by: Ionela Voinescu <[email protected]> Reviewed-by: Huisong Li <[email protected]> Signed-off-by: Viresh Kumar <[email protected]> (cherry picked from commit c471956 linux-next) Signed-off-by: Jamie Nguyen <[email protected]> Tested-by: Carol Soto <[email protected]>
Since commit 6c8d750 ("cpufreq / cppc: Work around for Hisilicon CPPC cpufreq"), we introduce a workround for HiSilicon platforms that do not support performance feedback counters, whereas they can get the actual frequency from the desired perf register. Later on, FIE is disabled in that workaround as well. Now the workround can be handled by the common code. Desired perf would be read and converted to frequency if feedback counters don't change. FIE would be disabled if the CPPC regs are in PCC region. Hence, the workaround is no longer needed and can be safely removed, in an effort to consolidate the driver procedure. Signed-off-by: Jie Zhan <[email protected]> Reviewed-by: Xiongfeng Wang <[email protected]> Reviewed-by: Huisong Li <[email protected]> [ Viresh: Move fie_disabled withing CONFIG option to fix warning ] Signed-off-by: Viresh Kumar <[email protected]> (cherry picked from commit ea1829d linux-next) Signed-off-by: Jamie Nguyen <[email protected]> Tested-by: Carol Soto <[email protected]>
jamieNguyenNVIDIA
requested review from
khfeng,
nvmochs,
KobaKoNvidia and
clsotog
November 1, 2024 22:58
clsotog
approved these changes
Nov 1, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acked-by: Carol L Soto ([email protected])
Successfully Passed the compilation for amd64 and arm64. |
Applied to the branch and pushed.
From: KobaKoNvidia ***@***.***>
Date: Sunday, November 3, 2024 at 8:09 AM
To: NVIDIA/NV-Kernels ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [NVIDIA/NV-Kernels] Cherry-pick cppc_cpufreq series into 24.04_linux-nvidia-adv-6.8-next (PR #29)
Successfully Passed the compilation for amd64 and arm64.
Acked-by: KobaK ***@***.******@***.***>)
—
Reply to this email directly, view it on GitHub<#29 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZPREEW6NAL4VQARYMFN5NLZ6ZDCXAVCNFSM6AAAAABRBC766CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJTGQ4DAMBRGY>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
KobaKoNvidia
approved these changes
Nov 4, 2024
nvidia-bfigg
force-pushed
the
24.04_linux-nvidia-adv-6.8-next
branch
from
December 4, 2024 16:46
855eb5c
to
d7a81ac
Compare
nvidia-bfigg
pushed a commit
that referenced
this pull request
Jan 4, 2025
syzbot reports that a recent fix causes nesting issues between the (now) raw timeoutlock and the eventfd locking: ============================= [ BUG: Invalid wait context ] 6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted ----------------------------- kworker/u32:0/68094 is trying to lock: ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180 other info that might help us debug this: context-{5:5} 6 locks held by kworker/u32:0/68094: #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0 #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0 #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180 #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180 #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 stack backtrace: CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29 Hardware name: linux,dummy-virt (DT) Workqueue: iou_exit io_ring_exit_work Call trace: show_stack+0x1c/0x30 (C) __dump_stack+0x24/0x30 dump_stack_lvl+0x60/0x80 dump_stack+0x14/0x20 __lock_acquire+0x19f8/0x60c8 lock_acquire+0x1a4/0x540 _raw_spin_lock_irqsave+0x90/0xd0 eventfd_signal_mask+0x64/0x180 io_eventfd_signal+0x64/0x108 io_req_local_work_add+0x294/0x430 __io_req_task_work_add+0x1c0/0x270 io_kill_timeout+0x1f0/0x288 io_kill_timeouts+0xd4/0x180 io_uring_try_cancel_requests+0x2e8/0x388 io_ring_exit_work+0x150/0x550 process_one_work+0x5e8/0xfc0 worker_thread+0x7ec/0xc80 kthread+0x24c/0x300 ret_from_fork+0x10/0x20 because after the preempt-rt fix for the timeout lock nesting inside the io-wq lock, we now have the eventfd spinlock nesting inside the raw timeout spinlock. Rather than play whack-a-mole with other nesting on the timeout lock, split the deletion and killing of timeouts so queueing the task_work for the timeout cancelations can get done outside of the timeout lock. Reported-by: [email protected] Fixes: 020b40f ("io_uring: make ctx->timeout_lock a raw spinlock") Signed-off-by: Jens Axboe <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-pick the following series into 24.04_linux-nvidia-adv-6.8-next: https://lore.kernel.org/linux-arm-kernel/[email protected]/